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Abstract 

Markov decision processes (MDPs) and simple stochastic games (SSGs) provide a rich math- 
ematical framework to study many important problems related to probabilistic systems. MDPs 
and SSGs with finite-horizon objectives, where the goal is to maximize the probability to reach 
a target state in a given finite time, is a classical and well-studied problem. In this work we 
consider the strategy complexity of finite-horizon MDPs and SSGs. We show that for all e > 0, 
the natural class of counter-based strategies require at most loglog(i) + n + 1 memory states, 
and memory of size f2(loglog(i) -I- n) is required, for e-optimality, where n is the number of 
states of the MDP (resp. SSG). Thus our bounds are asymptotically optimal. We then study 
the periodic property of optimal strategies, and show a sub-exponential lower bound on the 
period for optimal strategies. 

1 Introduction 

Markov decision process and simple stochastic games. The class of Markov decision pro- 
cesses (MDPs) is a classical model for probabilistic systems that exhibit both stochastic and and 
deterministic behavior [5J. MDPs have been widely used to model and solve control problems for 
stochastic systems [3]: there, non-determinism represents the freedom of the controller to choose 
a control action, while the probabilistic component of the behavior describes the system response 
to control actions. Simple stochastic games (SSGs) enrich MDPs by allowing two types of non- 
determinism (angelic and demonic non-determinism) along with stochastic behavior [1J. MDPs 
and SSGs provide a rich mathematical framework to study many important problems related to 
probabilistic systems. 

Finite-horizon objective. One classical problem widely studied for MDPs and SSGs is the finite- 
horizon objective. In a finite-horizon objective, a finite time horizon T is given and the goal of the 
player is to maximize the payoff within the time horizon T in MDPs (in SSGs against all strategies 
of the opponent). The complexity of MDPs and SSGs with finite-horizon objectives have been well 
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studied, with book chapters dedicated to them [HE]. The complexity results basically show that 
iterating the Bellman equation for T steps yield the desired result (HE]. While the computational 
complexity have been well-studied, perhaps surprisingly the strategy complexity has not received 
great attention. In this work we consider several problems related to the strategy complexity of 
MDPs and SSGs with finite-horizon objectives, where the objective is to reach a target state within 
a finite time horizon T. 

Our contribution. In this work we consider the memory requirement for e-optimal strategies, for 
e > 0, and a periodic property of optimal strategies in finite-horizon MDPs and SSGs. A strategy 
is an e-optimal strategy, for e > 0, if the strategy ensures within e of the optimal value against all 
strategies of the opponent. For finite-horizon objectives, the natural class of strategies are counter- 
based strategies, which has a counter to count the number of time steps. Our first contribution 
is to establish asymptotically optimal memory bounds for e-optimal counter-based strategies, for 
e > 0, in finite-horizon MDPs and SSGs. We show that e-optimal counter-based strategies require 
at most memory of size loglog(^) + n + 1 and memory of size f2(loglog(^) + n) is required, where n 
is the size of the state space. Thus our bounds are asymptotically optimal. The upper bound holds 
for SSGs and the lower bound is for MDPs. We then consider the periodic (or regularity) property 
of optimal strategies. The period of a strategy is the number P such that the strategy repeats 
within every P steps (i.e., it is periodic with time step P). We show a sub-exponential lower bound 
on the period of optimal strategies for MDPs with finite-horizon objectives, by presenting a family 
of MDPs with n states where all optimal strategies are periodic and the period is 
Organization of the paper. The paper is organized as follows: In Section [2] we present all 
the relevant definitions related to stochastic games and strategies. In Section [3] we show that 
0(n + logloge _1 ) number of bits are necessary and sufficient for e-optimal counter-based strategies, 
for all e > 0, in both finite-horizon MDPs and SSGs. In Section U] we show that there are finite- 
horizon MDPs where all optimal strategies are periodic and have a period of 2^( v/nlogn ). 

2 Definitions 

The class of infinite-horizon simple stochastic games (SSGs) consists of two player, zero-sum, turn- 
based games, played on a (multi-)graph. The class was first defined by Condon pQ. Below we define 
SSGs, the finite-horizon version, and the important sub-class of MDPs. 

SSGs, finite-horizon SSGs, and MDPs. An SSG G = (Si, S2, Sr, _L, (A*)seSiUS 2 us fl > so) consists of 
a terminal state _L and three sets of disjoint non-terminal states, Si (max state), S2 (min states), 
Sr (coin toss states). We will use S to denote the union, i.e., S = SiUS^USr. For each state s G S, 
let A s be a (multi-)set of outgoing arcs of s. We will use A = (J s A s to denote the (multi-)set of all 
arcs. Each state s G S has two outgoing arcs. If a is a arc, then d(a) G S'U{_L} is the destination of 
a. There is also a designated start state sq G S. The class of finite-horizon simple stochastic games 
(FSSGs) also consists of two player, zero-sum, turn-based games, played on a (multi-)graph. An 
FSSG (G, T) consists of an SSG G and a finite time limit (or horizon) T > 0. Let G be an SSG 
and T > 0, then we will write the FSSG (G,T) as G T . Given an SSG G (resp. FSSG G T ), for a 
state s, we denote by G s (resp. Gj) the same game as G (resp. G T ), except that s is the start 
state. The class of infinite (resp. finite) horizon Markov decision processes (MDPs and FMDPs 
respectively) is the subclass of SSGs (resp. FSSGs) where S2 = 0- 

Plays and objectives of the players. An SSG G is played as follows. A pebble is moved on to sq. 
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For i £ {1,2}, whenever the pebble is moved on to a state s in Si, then Player i chooses some arc 
a £ A s and moves the pebble to d(a). Whenever the pebble is moved on to a state s in Sr, then an 
a £ A s is chosen uniformly at random and the pebble moves to d(a). If the pebble is moved on to 
_L, then the game is over. For all T > the FSSG G T is played like G, except that the pebble can 
be moved at most T + 1 times. The objective of both SSGs and FSSGs is for Player 1 to maximize 
the probability that the pebble is moved on to _L (eventually in SSGs and with in T + 1 time steps 
in FSSGs). The objective of Player 2 is to minimize this probability. 

Strategies. Let S* be the set of finite sequences of states. For all T, let S- T C S* be the set of 
sequences of states, which have length at most T. A strategy Oi for Player i in an SSG is a map 
from S* x Si into A, such that for all w £ S* and s £ S we have o~i(w ■ s) £ A s . Similarly, a strategy 
a, L for Player i in an FSSG G T is a map from S- T x Si into A, such that for all w £ S- T and 
s £ S we have Ui{w ■ s) £ A s . In all cases we denote by IT the set of all strategies for Player i. If 
Si = 0, we will let denote the corresponding strategy set. Below we define some special classes of 
strategies. 

Memory-based, counter-based and Markov strategies. Let M = {0, 1}* be the set of possible mem- 
ories. A memory-based strategy Oi for Player i consists of a pair (a u ,a a ), where 

• a u , the memory- update function, is a map from M x S into M 

• a a , the next-action function, is a map from M x Si into A, such that for all m £ M and 
s £ Si we have a a {m, s) £ A s . 

A counter-based strategy is a special case of memory-based strategies, where for all m £ M and 
s,s' £ S we have a u (m,s) = a u (m,s'). That is the memory can only contain a counter of some 
type. We will therefore write a u (m,s) as a u (m) for all m,s and any counter-based strategy a. A 
Markov strategy Oi for Player i is a special case of strategies where 

Vp,p £ S- T : \p\ = \p'\ Ap\ p \ = pj p ,| £ Si =>• a(p',p\ p ,\) = a(p,p\ p \). 

That is, a Markov strategy only depends on the length of the history and the current state. Let IL[ 
be the set of all Markov strategies for Player i. 

Following a strategy. For a strategy, <jj, for Player i we will say that Player i follows Oi if for 
all n given the sequence of states {pi)i< n the pebble has been on until move n and that p n £ Si, 
then Player i chooses o-{{pi)i< n ,p n ). For a memory-based strategy for Player i <jj, we will say that 
Player i follows o~i if for all n given the sequence of states {pi)i< n the pebble has been on until move 
n, that p n £ Si and that m % = o u {m % ~ x ,pi) and that m° = 0, then Player % chooses a a (m n ,p n ). 

Space required by a memory-based strategy. The space usages of a memory-based strategy is the 
logarithm of the number of distinct states generated by the strategy at any point, if the player 
follows that strategy. A memory-based strategy is memoryless if there is only one memory used 
by the strategy. For any FSSG G T with n states it is clear that the set of strategies is a subset 
of memory-based strategies that uses memory at most Tlogn, since for any strategy a we can 
construct a memory-based strategy a' by using the memory for the sequence of states and then 
choose the same action as a would with that sequence of states. Hence we will also talk about 
e-optimal memory-based strategies. Also note that for any FSSG G T it is clear that the set of 
Markov strategies is a subset of the set of counter-based strategies that uses space at most log T. 

Period of a counter-based strategy. We will distinguish between two kinds of memories for a counter- 
based strategy a. One kind is only used once (the initial phase) and the other kind is used arbitrarily 
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many times (the periodic phase). Let m° = and m l = <7 u (m i_1 ). Then if m l = m? for some i < j, 
we also have that m l+c = m J+c and m 1 = m i+c " Hence if a memory is used twice, it will be 
reused again. We will let the number of memories that are only used once be N and the number 
of memories used more than once be p, which we will call the period. The number TV" is mainly 
important for e-optimal strategies and period is mainly important for optimal strategies. 

Probability measure and values. A pair of strategies (01,02), one for each player (in either an SSG 
or an FSSG), defines a probability that the pebble is eventually moved to JL. Let the probability 
be denoted as P " 1 - " 2 . For all SSGs G (resp. FSSGs G T ) it follows from the results of Everett [2] 
that 

sup inf P ai ' a2 = inf sup P ^. 

We will call this common value as the value of G (resp. G T ) and denote it val(G) (resp. val(G T )). 

e-optimal and optimal strategies. For all e > 0, we will say that a strategy o~\ is e-optimal for 
Player 1 if 

inf P CT i' CT2 + e > sup inf P a >\ 
<T2en 2 cr!giK (T 26ii2 

Similarly, a strategy 02 is e-optimal for Player 2 if 

sup P ffl >" 2 - e < inf sup P" 1 '^. 
o-ierii CT 2 en 2crieni 

A strategy a is optimal for Player i if it is 0-optimal. Condon [TJ showed that there exist optimal 
memoryless strategies for any SSG G that are also optimal for G s for all s £ S. This also implies 
that there are optimal Markov strategies for FSSGs that are also optimal for G s for all s £ S. 

3 Bounds on e-optimal counter-based strategies 

We will first show an upper bound on size of the memory used by a counter-based strategy for 
playing e-optimal in time limited games. The upper bound on memory size is by application of 
a result from Ibsen- Jensen and Miltersen [5J. The idea of the proof is that if we play an optimal 
strategy of G in G T for sufficiently high T, then the value we get approaches the value of G. 

Theorem 1 (Upper bound) For all FSSGs G T with n states and e > 0, there is an e-optimal 
counter-based strategy for both players such that memory size is at most log log e^ 1 + n + 1 

Proof Since there is an optimal Markov strategy, there is a counter-based strategy, which uses 
memory at most log T. As shown by Ibsen- Jensen and Miltersen [5] for any game G T ', if the horizon 
is greater than 21oge _1 2 ra , the value of G T approximates the value of G with in e. It is clear that 
the value of all states are the same in an infinite-horizon game if either player is forced to play an 
optimal strategy. Hence, if T > 21oge _1 2 n and either player plays an optimal strategy of G in G T , 
then the value of all states are within e of the value of the game. But there are optimal memoryless 
strategies in G as shown by Condon [I]. Therefore we have that in the worst case T < 21oge _1 2 n . 
Since logT is an upper bound, log log e -1 + n + 1 is also an upper bound and hence the result. □ 

We will now lower bound the size of the memory needed for a counter-based strategy to be 
e-optimal. Our lower bound will be divided into two parts. The first part will show that logloge~l 
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Figure 1: An MDP G, such that for all e > there is a T, such that all e-optimal memory-based 
strategies for G T require memory size of at least Q(loglog e _1 ). Circle vertices are the coin toss 
states. The triangle vertex is the max state. The vertex _L is the terminal state. 

is a lower bound on the memory required even for some MDPs with constantly many states. The 
second part will show that even for fixed e, an e-optimal counter-based strategy will need to use 
a memory of size O(n). Both lower bounds will show explicit MDPs with the required properties. 
See Figure [T]and Figure [2] respectively. 

MDP for the lower bound of log log e . Our first lower bound shows that in the MDP M (Figure [1]) 
all e-optimal memory-based strategies require at least loge -1 distinct memory states, i.e., the size 
of memory is at least log loge -1 . The MDP M is defined as follows. There is one state x in Si, the 
rest are in Sr. 

• The state T € Sr has A T = {(T, T), (T, T)}. 

• The state h £ Sr has A h = {(h, T), (h, _L)}. 

• The state 1 G Sr has A x = {(1, _L), (1, _L)}. 

• The state 2 G S R has A 2 = {(2, 1), (2, 1)}. 

• The state x E Si has A m = {(x, 2), (x, h)}. 

• The state startG Sr has j4 s tart = {(start, start), (start, x)}. 

Lemma 2 All e-optimal memory-based strategies in M T , for T = loge -1 — 1, require at least 
loge -1 — 2 distinct states of memory, i.e., the size of memory is at least log loge -1 . 

Proof We will first show the proof for counter-based strategies. At the end we will then extend 
it to memory-based strategies. 

It is clear that val(Mj) = | and for all T > 2 we have val(Mj) = 1. If Player 1 chooses (x, h) in 
M%, then he gains i, otherwise, if he chooses (x, 2), then he gains 0. Also for all T > 2, if Player 1 
chooses (x, 2) in AfJ, then he gains 1, otherwise, if he chooses (x, h), then he gains ^. 

In M s tart we end up at x after precisely k > 2 moves of the pebble with probability 2 -fc+1 . 
Therefore, by the preceding any optimal memory-based strategy a must be able to find out if T 
minus the length of the history is greater than 2 from the memory. 

Let e > be given. For simplicity we will assume that e = 2~ k for some k > 0. Let c = log e -1 . 
Assume now that there is a counter-based strategy a = (a u , a a ) that uses c — 3 states of memory 
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in M^. The pebble ends up at m after c — 3 moves with probability 2~( c ~ 3 ) +1 = 4e. Let the 
sequences of memories until then be m°,m , . . . , m c ~ 3 . Since a was e-optimal we must have that 
<r(m c_3 , x) = (x, h). On the other hand for all i < c — 3 we must also have that <T(m',i) = (x, 2). 
Therefore m c_3 differs from m i for z < c — 3. Now assume that m 1 = m J for i < j and i, j < c — 3. 
But then a u {m % ) = o~ u (mP) and hence m !+1 = m J+1 and then by repeating this argument we have 
that m k = m c ~ 3 for k < c — 3. Therefore m ! differs from m? for i ^ j and i,j < c — 3 and hence 
we need at least c — 2 different memory states. 

For general memory-based strategies the proof remains the same. This is because we can note 
that if the pebble ends up at x after c — 3 moves, we have that m° = and m 1 = a u (m' l ~ 1 , start) 
for 1 < i < c — 3 and hence they must all differ by the same argument as before. □ 

For our second lower bound we will use an infinite family of MDPs 

H = {H(l),H(2),...,H(i),...}, 

such that H[i) contains 2i + 4 states, one of which is a max state, and all e-optimal counter-based 
strategies require space at least i — 4, for some fixed e. 

Family of MDPs for the lower bound of n. The MDP H(i) is defined as follows. There is one state 
x in Si, the rest are in Sr. 

• The state T G S R has A T = {(T, T), (T, T)}. 

• The state h G Sr has Ah = {(h, T), (h, J_)}. 

• The state 1 G 5 fl has A 1 = {(1,-L), CM)}- 

• For i G {2, . . . , i}, the state j G S R has Aj = {{j, i), (j,j - 1)}. 

• The state x G Si has A m = {(x, i), (x, h)}. 

• The state 1* G S R has A x * = {(1* ,x)}. 

• For j G {2, . . . ,t}, the state j* G 5 K has A r = - 1)*)}. 

There is a illustration of H(4) in Figure [2j 

Let i be some number. It is clear that val(iT(i)^) = |. It is also easy to see that val(H(«)j) = 1, 
but that the time to reach _L from i is quite long. Hence, one can deduce that there must be a k 
(k depends on i) such that for all k' > k it is an optimal strategy in H(i) k to choose (x, z) and 
for all 2 < fc" < k it is an optimal strategy in H(i) k to choose (x, /i). In case there are multiple 
such numbers, let k be the smallest. The number — 1 is then the smallest number of moves of 
the pebble to reach _L from i, such that that occurs with probability > ^ (to simplify the proofs 
we will assume equality). 

Let p l be the probability for the pebble to reach x from i* in t or less moves (note that this 
is also the probability to reach _L in t moves or less from i). It is clear that p is equal to the 
probability that a sequence of t fair coin tosses contains i consecutive tails. This is known to be 
exactly 1 — F < ^ 2 /2 t , where F^9 2 is the (t + 2)'nd Fibonacci i-step number, i.e. the number given 
by the linear homogeneous recurrence Fc^ = Yl}=i ^c—j anc ^ the boundary conditions Fc^ = 0, for 
c < 0, = i*2 = 1 (this fact is also mentioned in Ibsen-Jensen and Miltersen [5]). 
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The next lemmas will prove various properties of p , Fa and k. We will first show two technical 
lemmas that will be used in many of the remaining lemmas. Next, we will show that k is exponential 
in i and show various bounds on p l . We will use all that to show that the number of states in the 
game is a lower bound on the memory requirement for e-optimal counter-based strategies. 

Lemma 3 Let i and a > i + 3 be given. Then 

F a W < (2 - 2-- 1 )F« 1 



Let b > 3 be given. Then 



Proof We can see that 



r b - Lr b-\ 



r b — / j r b-i — ZJ ft-l r b-i 

3=1 

for b > 3. Hence we have that F b {l) < 2F^} V 

(i) — '— 1 (i) 

We therefore have that F a _ i > 2 1 F^ and we can deduce that 



a-1- 

The desired result follows. □ 
Now for the proof that is exponential in i. 

Lemma 4 For a// i, we have that k > 2 l ~ 2 + i. 

Proof We will first show that p a < p a ~ 1 +2~ l . We can divide the event that there are i consecutive 
tails into two possibilities out of t fair coin tosses. Either the first i coin tosses were tails or there are 
i consecutive tails in the last t — 1 coin tosses (or both). The first case happens with probability 2~ % 
and the last with probability p a_1 . We can then apply union bounds and get that p a < p a ~ x + 2 _ \ 
Clearly we have that p l ~ x = and that p a is increasing in a. But we also have that 

p k < 2 i-2 2 -i + ^ 

1^1 t._o»-2 

^</- 2< " 2 , 

which means that k > 2' 1 ~ 2 + □ 

Lemma 5 Let i be given. The number k is such that 

e^r > (l _ 2 - i ~ 2 ) k > - 
4 



and such that 



e f > (1 _ 2 -*- 2 )*-' > - 
- v 7 - 2 



1 
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Proof We have that 1 - F® 2 /2 k = \, which we can then use to show that 

1 " ^ 2 /2 k = \ => 

(2 - 2~ i - 1 ) k - i 2 i 1 

- f 2- > - 

2 k ~ 2 

(I _ 2 -i-2)fe-i > I 
v ; - 2 

where we used Lemma[3]for the second implication. We used that F 2 = 1 f° r the third implication. 
Since k > 2 l ~ 2 + i > 2i by LemmalU we also have that (1 - 2~ l ~ 2 ) k > |. 

But we can also use Lemma 0] more directly. Notice that since i > 12 we have that 2 i+2 > 72. 



We have that, 

(1 _ 2 -i-2)fc-i < (1 _ 2 --2)2 1 - 2 = ((1 _ 2 -<-2 ) 2*+> ) i < e -l 

where we used that lim. r _ 5 . 00 (l — x _1 ) x = e _1 and that (1 — x~ l ) x is increasing in x for x > 1. We 
also have that e~§~ > (1 — 2~*~ 2 ) fc , by the same argument. □ 



Lemma 6 For all i and t, we have 

p 2t-2i < 2p t 

Proof Let t' = t — i. Hence, we need to show that p 2t ' < 2p t+i . The proof comes from the fact 
that to have i consecutive tails out of 2v fair coin tosses, the i consecutive tails must either start in 
the first half or end in the second half (or both). But to start in the first half means that it must 
end in the first t' + i elements. Therefore we can overestimate that probability with p l + \ Similar 
with the second half. We can then add them together by union bound and the result follows. □ 

l-d 

Lemma 7 Let i > 12 and < d < 1 be given. Then p dk < 1 — < \ ■ 

Proof Since d > j^, we have that dk > i, by Lemma H] and because i > 12. We will show that 



F®/2 dk > We have that 



10' 

l-d 



dk+2 



r dk+2/ Z — 



> 



r k+2 


(2 


- 2~ i - 1 )( 1 - 


-d)k'2dk 




r k+2 




(1 


- 2- i " 2 )( 1 - 


-d)k 2 k 




1 




2 • 


(1 - 2~ l "2 


\(l-d)k 




1 




2 • 


((1 - 2"<- 


2^ky-d 




1 




2 • 


( e -i)i-d 




i 


(i-d) 






2 





s 



where we used Lemma [3] for the first inequality, Lemma [5] for the second and that lim x _ +0O (l 



x 



~ l ) x = e 1 and that (1 — x 1 ) x is increasing in x for x > 1 for the third. □ 



Lemma 8 Let i > 12 and < d be given. Then p0-+ d ) k > 1 - ( e 8 )± > I. 

(l+d)fc+2/ Z — V / 2 ' 



Proof We will show that F,® /2( 1+rf ) fc < (e"r)i. We have that 



_ 9— i— l\dk 171W 

r (i+d)fc+2/ z - 2( 1+(J ) fc 

1 

/1 o — * — 2N(ifc_ 

~ 1 '2 

= ((1 _ 2 -i- 2) y I 

, =d 1 
= (eB)- 

where we used Lemma [3] for the first inequality and Lemma [5] for the second. □ 



Lemma 9 There is an e such that for all i > 12, there is a time-bound T such that all e-optimal 
counter-based strategies for H(i) T require memory size at least i — 5. 

The proof basically goes as follows: The pebble starts at i* with 2k + 1 moves remaining. 
First we show that there is a super-constant probability for the pebble to reach x using somewhere 
between | and 4^ moves. In that case there is at least + 1 moves left. We then show that there 
is some number p > \ independent of i such that the probability to reach _L from i in ^ is more 
than p. Secondly we show that there is a super-constant probability for the pebble to reach x using 

somewhere between ^ and W- moves. In that case there is at most ^ + 1 moves left. We then 

55 5 

show that there is some number q < ^ independent of i such that the probability to reach _L from i 
in 4£ is less than q. We can then pick e such that any e-optimal strategy must distinguish between 
plays that used between | and ^ moves to reach x from i* and plays that used between ^ and ^ 
moves to reach x from i* . We then show that that requires at least 0(k) distinct states of memory, 
and the result then follows from k being exponential in i, by Lemma HI 
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Figure 2: The MDP H(4). It is the fourth member of a family that will show that there exist 
FSSGs where, for a fixed e, all e-optimal counter-based strategies require memory size to be at 
least Circle vertices are the coin toss states. The triangle vertex is the max state. The vertex 

_L is the terminal state. 
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Proof The probability for the pebble to reach x using somewhere between | and 4^ moves is 



4k k 



l> ~ l> 1 F® 12* - (1 - F« /2i 

5 +Z 5 +Z 

of _ pW 



> 



4fc 

2¥F« -(2-2-- 1 )M 



4fc 

2~ 



( 2 f _(2-2- J " 1 )f )F 



(0 



4fc 

2T 



(l-(l-2-- 2 )f)F« 



2 



k 

2s 



(!_(!_ 2— 2 ) fc )(l-p 



-l.eio 
>(l-eT) — 

where we used Lemma [3] for the first inequality and Lemma [5] and Lemma [7] for the second. 

In this case we have at least ^ + 1 moves left. Therefore if the player chooses to move to i, 
there are at least ^ moves left. In that case, by LemmaEl the pebble will reach _L with probability 
at least 1 — (e4o)| > \. In both cases we see that the probability is strictly separated from \. 

The probability for the pebble to reach x using somewhere between ^ and ^ moves can be 
calculated similar to between | and 4^ moves. We end up with 

9fc 6fe -1 ■ , 6fe 

P 5 — V 5 > (1 — e 8 )(1 — p 6 )• 

(it 



We can get that by noting that ^ < ^ — 2z, because of Lemma H] and that i > 12. Hence we can 



Hence, we need a upper bound on p 5 , which is smaller than 1 and does not depend on k or i. 

6fc < 8k 
5-5 

m 6fc 4fc ^tj 

apply Lemma [6] followed by Lemma [7] and get that p~s~ < 2p~s~ < 2(1 — < 1. 

In this case we have at most 4^ + 1 moves left. Therefore if the player chooses to move to i, 

there are at most 4^ moves left. In that case, by LemmaO the pebble will reach _L with probability 

1 

at most 1 — < i. 

Let a be some e-optimal counter-based strategy and assume that a uses less than | — 1 states. 
We will show that if e is some sufficiently low constant, we get a contradiction and hence all 
e-optimal counter-based strategies uses at least | states. Our result than follows from Lemma [H 

Let m° = and m % = a u (m % ~~ 1 ). Since a uses less than | states, then m a = m b for some 
a < b < |. Hence also m a+c = m b+c for all c > 0, by definition. But then m a+c = m a + c +( b - a ) d 
for all c and d greater than 0. Hence, we can make a one to one map between memory m a for 
a E A = {|, . . . , 4^} and some memory m b for b G B = . . . , ^}, such that m a = m b , except 
for up to t of them, which is smaller than a third of the size of both A and B. 
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Let q t be the the probability to reach x from i* using exactly t moves of the pebble. For t > i + 1 
we have that 

opt*) _ p(0 
9* = - P*" 1 = % t+2 = = 2- i - 1 (l - P*- 1 - 1 )- 

(To have a sequence of % tails after precisely t coin flips for t > i, we need to have failed to get that 
many tails in a row for the first t — l — i coin flips and then gotten a head followed by i tails, which 
is also what our expression tells us.) 

We see that g* is decreasing for t > i + 1 , because p* is increasing. We can therefore calculate 
the probability to end up at x using a specific amount of time compared to all other times in A as 

11 

4fc 

q s 



2~ i - 1 (l 


-pi- 1 -*) 


2- i - 1 (l 


-ps ') 


§+i-i 




f +1-* 




4fc , . 





> 



iff, .2f 



3fc 



(2 - 2- J - 1 )-F 



[l _ 2 - i - 2 )-T 



3fc 



((1-2- 



_3_ 

> e4o j 

where we used Lemma [3] for the first inequality and Lemma [5] for the second. 

We can show similarly that all q l for t being in B are also equal up to a factor of e4o . Hence, the 
probability to reach x from i* with t time remaining for t — 1 £ ^4 is nearly uniformly distributed 
over A (up to a factor of e« ). Similar with t — 1 in B. Therefore we can pick an t\ (independent of 
i) such that ^(m 4 , x) = (x, h) for all but ^ of the i's in A. Similar, we can pick an 62 (independent 
of i) such that cr (m*,x) = (x,i) for all but ^ of the t's in B. 

By using e = min(ei, 62) both of all t in A have that a a (m t , x) = (x, h) and of all t in B 
have that a a {m t ,x) = (x,i). But this contradicts that we had a one to one map that mapped at 
least two thirds of all m a for a in A to some memory m b for b in B such that m a = m b (and at 
least two thirds of the b's got mapped to). 

Hence all e-optimal counter-based strategies uses memory at least |. The result then follows 
from k > 2 L ~ 2 + i from Lemma |U □ 

Theorem 10 (Lower bound) For all sufficiently small e > and all n > 5, there is a FMDP with 
n states, where all e-optimal counter-based strategies require memory size at least il(logloge _1 + n). 

Proof The proof is a simple combination of the two lower bounds in Lemma [2] and Lemma [H □ 
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Figure 3: The MDP G§. Circle vertices are the coin toss states. The triangle vertex is the Max 
state. The vertex _L is the terminal state. 
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4 A lower bound on the period of optimal strategies in MDPs 



We will in this section show that there exist FMDPs G, with n states, such that all optimal strategies 
can be implemented using a counter-based strategy, and the period is greater than 2^^ nlogn \ We 
will create such FMDPs in two steps. First we will construct a family, such that the i'th member 
requires that one state uses one action every 0(i) steps and in all other steps uses the other action. 
There is an illustration of a member of that family in Figure [3j Afterwards we will play many such 
games in parallel, which will ensure that a large period is needed for all optimal strategies. There 
is an illustration of such a game in Figure 01 

Let G p , p G {2, 3, . . . } be the following FMDP, with 2p — 1 coin toss states and one max state. 
The coin toss states are divided into the sets {1*, 2*, . . . , (p — 1)*} and {1, 2, . . . ,p}. To simplify 
the following description let state 0* denote the _L terminal state. A description of G is then 

• State i* has state (i — 1)* as both its successors. 

• State i has state (i — 1)* and (i — 1) as successors, except state 1 which has _L and state p as 
successors. 

• The max state has 1 and 2 as successors. 
There is an illustration of G5 in Figure [3l 

Lemma 11 Let p > 2 be given. State i has value 1 — 2~~^( fc ) in Gp for k > 0, where fi(k) is the 
function f { (k) = max fc /< fcAfc / mod p=i (k' , 0). 

Proof It is easily seen by induction that i* has value 1 in G p . Note that fi(k) = i for k mod p = i. 
The proof will be by induction in k. There will be one base case and two induction cases, one for 
1 < k < p and one for k > p. It is easy to see that state 1 has value | = 1 — | = 1 — 2 _ ^ 1 W in Gp 
and state j for j ^ 1 has value 0. That settles the base case. 

For 1 < k < p. Neither of the successors of state j, for j 7^ k, has changed values from Gp~ 2 
to G^ 1 . For state k, both its successors has changed value. The value of state k — 1* has become 
va^Gp -1 )/;;-!. = 1 and the value of state k — 1 has become val(Gp _1 )fc_i = 1 — 2~f k - l( * k ~ 1 \ The 
value of state k is then 

val(G?) t = 1 + 1 - 2 ' / "'"" = 1 + 1 - 2 '""" = 1 - = ! _ j-AW. 

\ pj* 2 2 

For p < k. Let i be k mod p . Neither of the successors of state j, for j / i, has changed 
values from Gp~ 2 to G^ -1 . The value of state i' = % — 1 mod p , in iteration fc — 1 is val(Gp -1 )j' = 
1 — 2~-^'( fc_1 ). The value of state i is then 



1 + 1_ 2-/^-1) l + l_2-( fc - 1 ) 

> P y- 



val(G *v. = = ^ x ~" = 1 - 9-(*-i)-i = 1 _ o-fm 



The desired result follows. □ 

The idea behind the construction of is that to find the state of the largest value among 1 
and 2, in GT , for p > 2 and T > 1, we need to know if T mod p = 1 or not. Let pi be the i'th 
smallest prime number. The FMDP is as follows: F^ consists of a copy of G Pi for i € {1, . . . , k}. 
Let the max state in that copy of G Pi be m^. There is a illustration of F2 in Figure HI 
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Figure 4: The FMDP F2. Circle vertices are the coin toss states. Triangle vertices are the max 
states. The vertex _L is the terminal state. 

We will now show that all optimal strategies for F^ are subsets of counter-based strategies 
with a period defined by k. Afterwards we will show that the number of states in can also be 
expressed in terms of k. At the end we will use those two lemmas to get to our result. 

Lemma 12 Any optimal strategy o~(k, T") in F^ is an finite memory counter-based strategies with 
period P = Y\ ie {\ k}Pi> where pi is the i 'th smallest prime number. 

Proof Let i be some number in {1, . . . , k}. The lone optimal choice for rrtj and T > is to use 
the action that goes to state 1 in G Pi if T mod pi = 1 and otherwise to use the action that goes 
to state 2 in G Pi by Lemma [TT1 Hence, by the Chinese remainder theorem there are precisely P 
steps between each time any optimal strategy uses the action that goes to 1 in all m^s. That is, 
any optimal strategy must do the same action at least every P steps. Furthermore it is also easy 
to see that any optimal strategy must do the same at most every P steps, by noting that T + P 
mod pi is 1 if and only if T mod pi is 1 and again applying Lemma [TTJ A strategy that does the 
same every P steps can be expressed by a counter-based strategy with period P, which also uses 
memory at most P. □ 

Lemma 13 The number of states in F^ is k}Pi- 

Proof For any i, G Pi consists of 2pi states. Ff, therefore consists of 2^ i£ { 1 k yPi states. □ 

Theorem 14 There are FMDPs G, with n states, where all optimal strategies are finite memory 
counter-based strategies with period 2 f2 (^ /nlo s n ). 
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Proof Let n be such that there exists a game Ff- with n states. Note that for any number there 
is always a larger number, a, such that Ff, has a states for some b. 

By Lemma 1131 we have that n = 2^1ie{i \^Vi- By the prime number theorem (see e.g. 
Newman [6]) we have that X!ie{i,...,fc} Vi = Y^ie{i,...,k} = o(k 2 logk). 

Let /(x) = x 2 logx for x > 1. The function /(x) is strictly monotone increasing and hence, 
has an inverse function. Let that function be f~ 1 (y). We have that f~ l {y) > \Jjj^j}> f° r y — 2, 
because 

r x (y)> A P^ /(r 1 ^)) > 

V lo gy V lo sy 

*= y ^/^ 2log <\/dH 

V logy V logy 

<= y>^log(X-^) 
log y \ logy 

. y , 
y > i — log y 
logy 

y > y 



Here, the first <;= follows by taking / _1 on both sides. The function f~ l is strictly monotone 
increasing, because f{x) was. The fourth <;= follows from y > ^j^j for y > 2 and log being 
monotone increasing. 

Therefore, let g(k) = 2J2ie{i k}Pii then y _1 ( n ) = ^( y^Iogn)- By Lemma fT2| we have that 
the period is riie{i k}Pi- Trivially we have that 

n Pi> n i=k\=2^ kx °^ 

ie{i,...,fe} ie{i,...,fe} 



We now insert ^(y 15^;) m place of & and get 
ie{i,...,k} 

The result follows. □ 



5 Conclusion 

In the present paper we have considered properties of finite-horizon Markov decision processes and 
simple stochastic games. The e-optimal strategies considered in Section [3] indicates the hardness 
of playing such games with a short horizon. The concept of period from Section [5] indicates the 
hardness of playing such games with a long horizon. Along with our lower bound from Section [4] 
we conjecture the following: 

Conjecture 15 All FSSGs have an optimal strategy, which is an finite memory counter-based 
strategy, with period at most 2 n . 
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